Chess, specifically online chess, has experienced rapid growth since the onset of the COVID-19 pandemic (Keener, 2022). With most team sports, extra-curriculars, and outdoor activities shut down, people looked to the internet to pass the time. Chess proved to be the perfect outlet as it is free-to-play, easy to learn, and known the world over. Crucial to the success of online chess is that it pairs players up based on their rating (more on that later). This allows novices and beginners to learn the game and improve at their own pace without the discouraging experience of losing to seasoned players. Furthermore, chess, being a classic and timeless game, is accessible to people of all ages. Not only were young people being drawn in by the new streaming exposure chess was getting, experts and professionals who competed at tournaments and adults who grew up playing the game could make a swift and simple transition to an online environment. This brings me to my own experience with chess.
I learned the game at the age of four, and was initially a quick learner. I would play every Friday at my elementary school’s chess club and would periodically play against family members. Soccer, swimming, and basketball prevented chess from being my main extra-curricular activity, but I maintained a loose interest in the game for many years. I would play online every now and then, but that was the extent of my relationship with chess. That was until the pandemic hit. Now that I was in college and my commitments to my other sports were all but over, I rekindled my love for the game. I began watching livestreams and Youtube videos, I started following professional tournaments, and most importantly, I was playing regularly again. My background playing the game and my age meant that I was at the intersection of the chess community’s two new groups. I was not a novice player that chess’ new online presence had attracted, nor was I an adult or expert player that switched to online chess following the pandemic. As such, over the past few years I have occupied a middle-ground in the community as an intermediate level player. I have seen friends and family members learn the game and fall in love with it as I did, and I have learned from higher-rated players by analyzing my losses and studying professional games.
My position within the online chess community as an intermediary between beginners and experienced players is best understood by my Elo rating. Unlike card games and other board games played at the professional level such as poker, contract bridge, or go, chess makes use of a rating system to define a player’s strength. While I will not go into depth regarding the mathematical formulas underlying the Elo rating system, the crucial concept is that a player loses or gains rating points for each game depending on the result and the rating of their opponent. Take, for example, two players: player A has a rating of 1000 and player B has a rating of 1200. If player A wins the game, he would gain more rating points than if player B won the game because they would have defeated a player of a higher rating (Elo, 1978). The same applies for draws. If players A and B draw their game, player A would gain a small amount of rating points while player B would lose a small amount of rating points (obviously this number is smaller than if the result was a win for either side). This system has proven extremely accurate in estimating the strength of players from the beginner level all the way up to the top professionals since its implementation, first in the U.S. in 1960 and then worldwide in 1970 (Harkness, 1967, p. 184; Elo, 1978, p. 68). As such, loose rating brackets have been constructed to group players based on their rating. The U.S. Chess Federation (USCF) and FIDE, the international chess governing body, are separate entities meaning a player can have both a USCF and FIDE rating, but both use variations of the Elo rating system. The rating scales for the two organizations are shown in the charts below.
| FIDERatingRange | Category | |
|---|---|---|
| 1 | 2700+ | No formal title, but sometimes informally called “super grandmasters” |
| 2 | 2700-2500 | most Grandmasters (GM) |
| 3 | 2499-2400 | most International Masters (IM) and some Grandmasters (GM) |
| 4 | 2399-2300 | most FIDE Masters (FM) and some International Masters (IM) |
| 5 | 2299-2200 | FIDE Candidate Masters (CM), most national masters (NM) |
| 6 | 2199-2000 | Candidate masters (CM) |
| 7 | 1999-1800 | Class A, category 1 |
| 8 | 1799-1600 | Class B, category 2 |
| 9 | 1599-1400 | Class C, category 3 |
| 10 | 1399-1200 | Class D, category 4 |
| 11 | 1199-1000 | Class E, category 5 |
| 12 | Below 1000 | Novices |
| USCFRatingRange | Category | |
|---|---|---|
| 1 | Senior master | 2400 and up |
| 2 | National master | 2200-2399 |
| 3 | Expert | 2000-2199 |
| 4 | Class A | 1800-1999 |
| 5 | Class B | 1600-1799 |
| 6 | Class C | 1400-1599 |
| 7 | Class D | 1200-1399 |
| 8 | Class E | 1000-1199 |
| 9 | Class F | 800-999 |
| 10 | Class G | 600-799 |
| 11 | Class H | 400-599 |
| 12 | Class I | 200-399 |
| 13 | Class J | 100-199 |
Because the requirements to register a FIDE ranking are more restrictive and require a player to achieve a draw against an opponent who is already rated, I will be referencing the USCF rating classifications throughout the course of this project. The goal of this project is to reflect the accuracy of this rating system when it comes to predicting the result of online chess games and the overall strength of a player. This will be done, first, by looking at my own games, wherein I will dive into blunder rates, opening choices, and other factors that influence game outcomes. Next, I will turn to players who are both lower and higher rated than myself in order to illustrate that the rating system remains accurate regardless of rating bracket.
| ID | Result | Rating | Opponent | OpponentRating | Color | Date | Moves | TimeControl | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1268 | EKhan_69 | 1268 | White | 11/15/2023 | 24 | 15|10 |
| 2 | 14 | 1 | 994 | bobbyfenix | 986 | White | 11/12/2023 | 29 | 2|1 |
| 3 | 152 | 0 | 1081 | u_mah_qween | 1078 | Black | 11/1/2023 | 56 | 5 min |
| 4 | 203 | 0 | 1220 | wcslowman | 1271 | Black | 10/21/2023 | 27 | 1 min |
| 5 | 353 | 0.5 | 1006 | yuripcastro | 978 | Black | 10/3/2023 | 65 | 3|2 |
The “1 min” and “2|1” time controls both fall under the bullet category, “5 min” and “3|2” are both blitz, and “15|10” is rapid, so the value in the “Rating” column for each of those three groups corresponds to one of my three different Chess.com ratings (bullet, blitz, rapid). Thus, after winning my 15|10 game, only my rapid rating increased.
Now to dive into my own results. I have collected data for 400 chess games I played between October 1st and November 15th. In addition to the rating data, I have my opponent’s username, the color I was playing, the date, and the number of moves. Naturally, when playing a game where I am the higher rated player, a win is expected. Conversely, when I am the lower rated player, my opponent is favored. The plots below depict my games based on my rating and my opponent’s rating after each game is played. The blue line cutting through the plots has slope 1, so, in context, points above the line correspond to games where my opponent was higher rated whereas points below the line represent games where I was higher rated.
As can be clearly seen, both in the plot comprising all my games but also in the plots separated by time category, there is a clear separation between the wins and losses. The losses, each represented by an “x”, are overwhelmingly located above the line whereas the wins, denoted by circles, are more heavily concentrated below the line. This reinforces the accuracy of the rating system as, more often than not, I won the games where I was higher rated and lost the games where I was lower rated. It is worth noting the lack of draws, as those who follow professional chess will know that a high percentage of top-level Grandmaster games end in draws (over 50% at classical time controls). Why then, out of 400 games, have I only drawn 10? The answer lies in the blunders.
Chess.com, the site where I play all my games, uses a powerful chess engine to evaluate positions and determine best play for both sides. A player can use this engine to analyze their moves after the game has ended. The engine uses a numeric scale in assessing positions where positive numbers indicate an advantage for white, negatives an advantage for black, and 0.0 when equal. The engine analyzes hundreds of thousands of possible positions per second by looking at all possible moves and variations for both sides. The number itself is calculated based off the standard point value assigned to each piece (pawns are 1 point, bishops and knights are each 3 points, rooks are 5 points, and the queen is 9). For example, an engine evaluation of -0.3 (a slight advantage for black) essentially means that if black plays the best move in the position, their advantage is worth roughly a third of a pawn. While the engine’s accuracy can vary depending on the complexity of a given position, an evaluation of +1.5 or -1.5 is broadly considered to be a decisive advantage.
On the hierarchy of move classifications, “blunder” is right at the bottom just below “mistake”. It is a crucial error that has the potential to change the predicted outcome of the game. This can occur either from the loss of significant material, the surrender of positional control, or, simply, from allowing checkmate. Chess.com’s engine takes the point value of the position evaluation and calculates the winning percentage for both sides. Thus, every move that is not deemed best play will change the winning percentage. A blunder, by this understanding, is a move that changes the winning percentage by between 20% and 100%. Chess.com goes on to define a “mistake” as any move that changes the winning percentage by between 10% and 20% and an “inaccuracy” as a slight error between 5 and 10 percentage points. Mistakes can be consequential, particularly in crucial positions where there are only a few viable moves. Inaccuracies, however, are rarely fatal, and are typically only exploited in Master-level play. Naturally, if a game contains a lot of these errors, blunders in particular, the likelihood the result is decisive increases. The following plots represent a set of 49 games I played with a time control of 5 minutes between October 14th and October 31st. For these games, I manually collected advanced data such as number of blunders, mistakes, best moves, etc.
The first of the above plots is a stacked histogram displaying the number of games played by error count, separated by error type (Blunder, Mistake, Inaccuracy). The distribution of the three types of errors gives insight into how accurate the play is at my rating level (Class E according to the USCF classifications) for a given game. For example, it is rare for games to include more than 2 blunders. In context, this means that Class E blitz players, more often than not, are strong enough to punish the fatal errors of their opponent and convert winning positions. Mistakes and inaccuracies, however, are more evenly distributed throughout the histogram, with a notable exception at 0. We can draw several conclusions from this. Not only are Class E blitz players still prone to making several moderate errors every game, they either fail to recognize such errors when played by their opponent or are incapable of capitalizing on them. Additionally, the lack of games with 0 Mistakes or Inaccuracies combined with the graph’s right skew helps to explain the lack of draws at the Class E level. The more errors there are in a game, the more the evaluation changes, which translates to dynamic, polarized positions. Class E players are strong enough to recognize when they have winning advantages, as explained by the distribution of blunders, but are not always strong enough to execute perfectly, hence the prevalence of Mistakes and Inaccuracies. Mistakes and Inaccuracies rarely shift the balance of the game decisively, thus the player with the winning advantage has room to commit minor errors without throwing the win away. In any case, when both players are making several moderate errors a game, positions rarely even out to equality, which makes sense when considering the scale of engine evaluation. The “draw zone” between -1.5 and +1.5 is comparatively small when considering one player could have an advantage of more than 60 points depending on the position and material imbalance (one player is up a queen, rook, etc.). Thus, when errors pile up, draws are extremely rare.
Despite the variance and uncertainty that errors generate, the rating
system still does an excellent job of predicting their occurrence.
Chess.com uses skill-based match-making to pair up players based on
their rating. For example, when looking for a game, Chess.com limits the
search to players with a rating 200 points either side of your own. This
meant that I was playing exclusively against other Class E players. The
second plot above depicts the distribution of total errors for both me
and my opponents. While some individual counts are one-sided, the
general trend indicates that my opponents and I are committing a similar
number of errors. As is the case with performance metrics in team sports
(points per game for football or basketball, for example) the accuracy
of the Elo rating system is revealed when studying larger datasets
rather than individual games. Any player can overperform or underperform
in a given game and upsets can certainly happen, but as the plot
illustrates, players of a similar rating commit a similar number of
errors over time. This remains true even when studying blunders
specifically.
The patchwork of density plots above focuses on blunder percentage, that is, the percentage of total moves made that are blunders. We look at blunder percentage rather than discrete counts in an effort to connect accuracy with results. We use percentage here over discrete counts as it is more representative of individual game accuracy. For example, one is far more likely to lose when making 2 blunders in a 30 move game than when making 2 blunders in an 80 move game. The plot on the left shows the density of games by blunder percentage for me and my opponent. The heavy overlap indicates that, on a game-to-game basis, players of a similar rating make a similar number of blunders, further reinforcing the argument made earlier. In assessing my own performance, I look at this plot and take notice of the small areas of non-overlap. Particularly, I see that my opponents have slight;y more games with a blunder percentage between 5-15% than me while I am more likely to have more egregiously inaccurate games with a blunder percentage of over 15%. This tells me that when I am playing poorly, I have a tendency to spiral and completely lose focus. However, when I am playing well, I gain confidence, get into a groove and continue to play accurately.
The plot on the right simply offers an alternative interpretation by depicting blunder differential percentage in place of the overlapping density curves. Blunder differential percentage here being the difference between my opponents blunder percentage and my own. For example, if I have made one fewer blunder than my opponent over 100 moves, the blunder differential percentage is 1%. Naturally, the resulting curve is split almost evenly by the y axis. The peak at 0 illustrates, again, how my opponent and I, in a given number of moves, commit a similar number of blunders. With this basis established, let us see how blunder differential percentage directly affects the outcome of games.
Above is a box plot depicting blunder differential percentage for each of the three possible results, 1 being a win, 0.5 being a draw, and 0 being a loss. I include draws here not because the result is statistically significant (the sample’s singular draw does little to aid prediction); rather, I want to again highlight how the prevalence of blunders at the Class E level results in more decisive games. As is shown, across the 49 game dataset including the advanced statistics, I drew only one blitz game. It should be no surprise that the median blunder differential is negative for my losses and positive for my wins. When I blunder less than my opponent, I am more likely to win the game. Again, individual games can be exceptions as there are instances when I won a game where the blunder differential was negative and vice-versa. Now that we have established how blunders effect the outcome of games, I would like to analyze where my blunders come from.
“The opening,” as it is called in the chess community, is crucial to understanding the identity of a given game. While the opening is not defined by a set length, we can distinguish between the broader categories after the first 3 to 5 moves. For example, most games, but not all, that start with the move e4 for white and c5 for black are played in the “Sicilian Defense.” Thus, if I were to describe a game I won with that opening, I might say the following: “I played a very sharp Sicilian today in the Najdorf variation.” In other words, the game started with e4 and c5, later moves were made that corresponded to the Najdorf variation of the Sicilian Defense, and the game was very complicated (hence “sharp”). “Sharp” here is an appropriate adjective to describe the Sicilian Defense, as different openings can lead to different positions. The Sicilian is known to be an active defense for black that can lead to dynamic positions where blunders are common and either player can leave the opening with an advantage. Other openings, such as the Ruy Lopez (e4 e5 Nf3 Nc6 Bb5), are typically more solid and strategic and nature with little variation in the engine evaluation. Thousands of variations are possible out of every opening, some more complex than others, but those that are favored by Masters often define the reputation of openings within the chess community.
The treemap above was constructed using data taken from Chess.com’s database of nearly 3 million games. The map is broken into four sections, each corresponding to the most popular first move for white (e4, d4, Nf3 and c4). The moves inside each group correspond to a selection of black’s most popular responses. As explained previously, an opening is not always determined after move 1. Thus, several popular first moves could lead to different openings (d4 e6 being a notable example). I set the limit at one move each for this visualization as there are 71852 possible chess positions after move two (of which several hundred are popular). The purpose of this visualization is to illustrate the general popularity of a selection of openings, which is why its scope is limited to move 1. As white I always start with d4, to which my opponent almost always responds with either Nf6 (the Indian Game) or d5 (the Closed Game). When I am black I play e5 (the Open Game or King’s Pawn Opening) in response to e4, which is the most popular first move for white. This collection of moves informs the following plot, which attempts to relate my opening choices to position complexity.
The above plot depicts the four most popular openings across my 49 blitz games where advanced data was collected: the King’s Indian (d4 Nf6 c4 g6), Petrov’s Defense (e4 e5 Nf3 Nf6), the Queen’s Gambit Declined (d4 d5 c4 e6), and the Slav Defense (d4 d5 c4 c6). The error bars extending from the top of each bar correspond to upper bound of average blunder percentage for each opening. We use average blunder percentage to define opening complexity because if I am blundering frequently, it is more likely that the position is complex. Conversely, if playing the correct move was obvious, then the position would be relatively straightforward. Thus, the larger the error bar, the greater the variation in position complexity originating from each opening.
Two openings stand out from this plot: the King’s Indian and Petrov’s Defense. Not only is my average blunder percentage quite low when playing the King’s Indian, the small error bar indicates that I rarely find myself in complicated positions when playing the opening. Conversely, my average blunder percentage is much higher when I play the Petrov versus other openings. Additionally, the error is quite large, indicating that the Petrov leads to a number of different positions of varying complexity in which I seem to play quite poorly. This information provides me with useful information regarding which openings I should be playing more regularly and which I should consider dropping from my repertoire.
With knowledge of how my own rating relates to the accuracy of my play and my results, we can zoom out and analyze the accuracy of the Elo rating system across the online chess community.
The histogram above shows the distribution of accounts on Chess.com by rating classification (“Chess.com Leaderboard,” 2023). Here we use the account’s blitz rating rather than rapid or bullet to keep consistent with previous analysis. Class J players, with ratings between 100 and 199, are true beginners who are learning about captures and how the pieces move. Thus, most players do not spend a great amount of time in that rating range. The bulk of accounts are Class I, H, and G players with ratings between 200 and 800. These accounts range from novices learning elementary strategy to more experienced players who play semi-regularly. Class E, as expected, falls right in the middle of the distribution. These are players, as I explained in the introduction, that have a lengthy background playing the game but are not necessarily playing on a daily basis. These players, myself included, might play games in bursts (5-10 games every few days) when free to do so, but are not portioning out time daily to play online. This changes when we get to the higher level class players (Class C, B, A) and Experts. These are players that play several games daily, compete in tournaments (both online and in-person) and study chess in their spare time. The distribution really thins out when we get to the Masters. Here we find the semi-professional and professional players who make solid money playing chess by winning cash prizes in online and in-person tournaments or by working as chess coaches. These players are typically studying chess for several hours and are playing upwards of 20-30 games online a day.
We have already viewed how rating and accuracy are correlated at the Class E level. Let us now explore players either side of that rating range to determine if the rating system is consistently accurate at predicting chess strength. For the following two sets of plots, I collected data from blitz games played by two friends of mine. The first is my friend Parker. He is a slightly stronger player than myself who’s blitz rating has fluctuated between Class D and C over the course of 365 games played since January 1st. The second is Liam, a slightly weaker player who has only started playing chess consistently over the past few months. Hence, his dataset consists of all 108 blitz games played on his account starting from November 8, 2021. His progression follows that of many beginners as he started out 2 years ago as a Class J player and has slowly worked his way up to Class G/F level.
The first plot we have seen before. The result of each game is denoted by a symbol and the location of each point either side of the line determines who the higher rated player is. We can see that there are cases where the lower rated player won the game, but it is clear that the higher rated player was victorious more often. The box plot above confirms this by comparing rating differential with result. When the result is a win, we would expect the rating differential to be positive as Parker would be the higher rated player. Conversely, when the result is a loss, we would expect the rating differential to be negative. Not only is this the case, but the rating differential for Parker’s draws falls right in between. Note also that Parker, a higher rated player playing at Class D/C level, has drawn 15 games out of 365 where I only drew 10 in 400. Thus, a slight increase in rating translates to slightly fewer blunders which translates to slightly more draws. Another example of the Elo rating system’s predictive power.
We see the same trends in Liam’s plots. Liam’s case is especially interesting as his rating has increased by over 600 points during the period analyzed. He had to win games to improve his rating, and he did have some important victories over higher rated players, but throughout his rise – both when he was a Class J player and now as a Class G/F player – the higher rated player was more likely to win. In effect, Liam was proving his strength by beating people of a similar or lower rating, but was only winning games periodically against higher rated opposition, which is the case for most players who improve gradually. Again, the box plot confirms the trends shown in the scatter plot as the rating differential is positive for Liam’s wins and negative for Liam’s losses. Additionally, Liam only drew 3 games in 108, further confirming the relationship between draws, accuracy, and rating.
The last subject for community analysis is the Norwegian Grandmaster and former World Champion Magnus Carlsen, considered by many to be the best chess player of all time. Carlsen’s FIDE ratings for classical, rapid and blitz chess are the highest in the world, and he has a Chess.com blitz rating of 3317 (“Live chess ratings,” 2023; “Chess.com Leaderboard,” 2023). In classical chess, Carlsen has not played a higher rated player since 2011, and over the 391 blitz games collected, only Hikaru Nakamura has overtaken him as Chess.com number one. As we saw earlier, there are very few accounts with blitz ratings exceeding 2400 on Chess.com and even fewer with ratings upwards of 3000. Therefore, the rating range of Carlsen’s opponents is considerably more broad than it is for weaker players; otherwise, Carlsen would have to wait several hours to find an opponent. The scatter plot above accounts for this widened range by adjusting the axis limits. This adjustment necessarily alters the position of the dividing line of slope 1 as Carlsen’s rating fluctuates by only 150 points while he is playing opponents with ratings between 2500 and 3300. The few points that are located above the line correspond to games played against Hikaru Nakamura when he was the top blitz player on Chess.com. Carlsen’s victories in that match resulted in him retaking the top spot.
Another intriguing feature of Carlsen’s scatter plot are the series of points arranged in diagonal lines. This occurs as Carlsen, rather than waiting to find a new opponent after each game, arranges matches with other Grandmasters in which several games are played in succession. The slight negative slope of each of these groups shows the Elo rating system in action. Notice how each line starts with a loss and is followed by a streak of wins. The losses are positioned in this way because rating is recorded at the conclusion of each game. When Carlsen losses to lower rated opposition his rating takes a big hit. This translates to a big shift towards the top left of the graph as Carlsen’s rating decreases greatly while his opponent’s increases sharply. These losses are followed by a series of wins for two reasons. Firstly, Carlsen is the superior player and is expected to win, and secondly, Carlsen’s victories only result in slight rating changes for him and his opponent. Therefore, the wins follow closely in succession until Carlsen losses, in which case the series starts again elsewhere in the graph.
Because Carlsen is the highest rated blitz player on Chess.com his rating differential box plot looks quite different from Parker’s and Liam’s. The rating differential for his victories still exceeds the rating differential for his draws and losses, but the median value for all three is positive. Again, this is explained by the fact that Carlsen is almost never playing against higher rated players. The standout characteristic of this box plot is that the median rating differential for Carlsen’s draws and losses are roughly equal. One might expect this value to be lower for Carlsen’s losses, but the reality is that the difference between a draw and a loss for him is minuscule. As of December 11th, 2023, Carlsen is 83 points ahead of Hikaru Nakamura atop the Chess.com blitz leaderboard. This gap is such that Carlsen, in any given game, can risk playing slightly inaccurate moves in hopes of complicating the position and confusing his opponent. Such thinking often results in Carlsen hurting his own position to create a situation where his opponent might make a mistake and surrender the advantage. Carlsen employs this strategy because an individual loss or draw does little to effect his position atop the leaderboard. In practice, such play has polarizing results. Because he is playing against other Masters, his opponents are often strong enough to either punish these “calculated errors” and win or capable enough to restore equality and draw. On the other hand, Carlsen, being the highest rated player in the world, can engineer positions of such chaos that only he can play perfectly. The result is that Carlsen draws and loses a similar number of games.`
To close, by analyzing my own games and those of others in the online chess community, the accuracy of the Elo rating system becomes evident. Not only does one’s rating relate to the quality of their moves, as was seen with the blunder analysis, it also is a reliable predictor of game results. Beyond its function as a predictive metric, the Elo rating allows players to compare their own strength with that of their friends, peers, opponents and favorite professionals. In a similar vein, the rating system allows players to set concrete goals for themselves as they look to improve their own game. And lastly, from an administrative point of view, the Elo rating system gives organizations such as FIDE, USCF, and Chess.com a method by which they can assign titles and classifications to players, making it easier to establish concrete ranking systems and to group up players for tournament play.